Mihai Ion
  • Home
  • Research
  • Teaching
  • CV
  1. FIN 525
  2. Lectures
  3. L11: Dates, lags, sorting
  • Teaching
  • FIN 525
    • Lectures
      • L00: Jupyter basics
      • L01: Introduction
      • L02: Variables, types, operators
      • L03: Data structures
      • L04: Conditionals, loops
      • L05: Functions, packages
      • L06: Pandas intro
      • L07: Pandas I/O
      • L08: Pandas filtering
      • L09: Pandas data cleaning
      • L10: Merging, reshaping datasets
      • L11: Dates, lags, sorting
      • L12: Descriptive stats
      • L13: Conditional stats, outliers
      • L14: Conditional stats applied
      • L15: Linear regression intro
      • L16: Linear regression applications
      • L17: Panel regression intro
      • L18: Robust panel regression
      • L19: Robust timeseries regression
      • L20: Backtesting - data prep
      • L21: Backtesting - sumstats
      • L22: Backtesting -returns
      • L23: Backtesting - risk adjustment
  • FIN 421
    • Lectures
      • L01: Introduction
      • L02: Analyzing past returns
      • L03: Modeling future returns
      • L04: Portfolio theory intro
      • L05: Optimal capital allocation
      • L06: Tangency portfolios
      • L07_08: Optimal asset allocation
      • L09: Review
      • L10_11: Statistical models of returns
      • L12: CAPM
      • L13: Cost of equity
      • L14: Bond pricing
      • L15: Bond yields
      • L16: Bond risk
      • L17: Valuation data processing
      • L18_19: Multiples valuation
      • L20_21: Dividend discount models
      • L22_23: Discounted cash flow analysis
      • L24: Valuation sensitivity analysis
      • L25: Options intro
      • L26: Risk management with options

On this page

  • Date formats: “pd.to_datetime()” and “dt.to_period()”
  • Sorting
    • .sort_values()
    • .sort_index()
  • Lagging and leading
    • Lagging and leading using .shift()
    • Leading and lagging with .merge()

L11: Dates, lags, sorting

import pandas as pd
import numpy as np
df = pd.read_excel('./rawdata12.xlsx')
df
firmid date return
0 1 09/30/2010 0.050
1 2 10/31/2010 0.450
2 3 11/30/2010 23.000
3 1 11/30/2010 0.870
4 2 11/30/2010 0.200
5 3 12/31/2010 0.340
6 1 12/31/2010 0.060
7 2 12/31/2010 0.001
8 3 12/31/2010 -0.120

Date formats: “pd.to_datetime()” and “dt.to_period()”

Pandas offers a lot of flexibility to manipulate dates and time stamps. Much of this functionality can only be used on columns that have the Pandas “datetime” data type. We can convert dates to this data type using the .to_datetime() function.

First, not that the date column in our dataframe is of type “object”:

df.dtypes
firmid      int64
date       object
return    float64
dtype: object

We’ll create a copy of the df dataframe to avoid changing the original data in df:

df2 = df.copy()
df2['dtdate'] = pd.to_datetime(df2['date'])
df2
firmid date return dtdate
0 1 09/30/2010 0.050 2010-09-30
1 2 10/31/2010 0.450 2010-10-31
2 3 11/30/2010 23.000 2010-11-30
3 1 11/30/2010 0.870 2010-11-30
4 2 11/30/2010 0.200 2010-11-30
5 3 12/31/2010 0.340 2010-12-31
6 1 12/31/2010 0.060 2010-12-31
7 2 12/31/2010 0.001 2010-12-31
8 3 12/31/2010 -0.120 2010-12-31
df2.dtypes
firmid             int64
date              object
return           float64
dtdate    datetime64[ns]
dtype: object

Now we can apply many useful date functions (they usually have the prefix dt.) to this datetime variable. For example, we can extract information about specific components of the date:

df2['year'] = df2['dtdate'].dt.year
df2['month'] = df2['dtdate'].dt.month
df2['day'] = df2['dtdate'].dt.day
df2
firmid date return dtdate year month day
0 1 09/30/2010 0.050 2010-09-30 2010 9 30
1 2 10/31/2010 0.450 2010-10-31 2010 10 31
2 3 11/30/2010 23.000 2010-11-30 2010 11 30
3 1 11/30/2010 0.870 2010-11-30 2010 11 30
4 2 11/30/2010 0.200 2010-11-30 2010 11 30
5 3 12/31/2010 0.340 2010-12-31 2010 12 31
6 1 12/31/2010 0.060 2010-12-31 2010 12 31
7 2 12/31/2010 0.001 2010-12-31 2010 12 31
8 3 12/31/2010 -0.120 2010-12-31 2010 12 31

Another common use of the .to_datetime() function is to construct a datetime variable from date components:

df2['newdate'] = pd.to_datetime(df2[['year','month','day']])
df2
firmid date return dtdate year month day newdate
0 1 09/30/2010 0.050 2010-09-30 2010 9 30 2010-09-30
1 2 10/31/2010 0.450 2010-10-31 2010 10 31 2010-10-31
2 3 11/30/2010 23.000 2010-11-30 2010 11 30 2010-11-30
3 1 11/30/2010 0.870 2010-11-30 2010 11 30 2010-11-30
4 2 11/30/2010 0.200 2010-11-30 2010 11 30 2010-11-30
5 3 12/31/2010 0.340 2010-12-31 2010 12 31 2010-12-31
6 1 12/31/2010 0.060 2010-12-31 2010 12 31 2010-12-31
7 2 12/31/2010 0.001 2010-12-31 2010 12 31 2010-12-31
8 3 12/31/2010 -0.120 2010-12-31 2010 12 31 2010-12-31

The other, very commonly used type for date data is the Pandas period format. This is used to specify that your data has a particular frequency, and can be done by applying the .to_period() function to a datetime variable (e.g. use ‘Y’ for yearly frequency data, ‘M’ for monthly, and ‘Q’ for quarterly):

df2['yearly'] = df2['dtdate'].dt.to_period('Y')
df2['monthly'] = df2['dtdate'].dt.to_period('M')
df2['quarterly'] = df2['dtdate'].dt.to_period('Q')
df2
firmid date return dtdate year month day newdate yearly monthly quarterly
0 1 09/30/2010 0.050 2010-09-30 2010 9 30 2010-09-30 2010 2010-09 2010Q3
1 2 10/31/2010 0.450 2010-10-31 2010 10 31 2010-10-31 2010 2010-10 2010Q4
2 3 11/30/2010 23.000 2010-11-30 2010 11 30 2010-11-30 2010 2010-11 2010Q4
3 1 11/30/2010 0.870 2010-11-30 2010 11 30 2010-11-30 2010 2010-11 2010Q4
4 2 11/30/2010 0.200 2010-11-30 2010 11 30 2010-11-30 2010 2010-11 2010Q4
5 3 12/31/2010 0.340 2010-12-31 2010 12 31 2010-12-31 2010 2010-12 2010Q4
6 1 12/31/2010 0.060 2010-12-31 2010 12 31 2010-12-31 2010 2010-12 2010Q4
7 2 12/31/2010 0.001 2010-12-31 2010 12 31 2010-12-31 2010 2010-12 2010Q4
8 3 12/31/2010 -0.120 2010-12-31 2010 12 31 2010-12-31 2010 2010-12 2010Q4

These type of period date are useful for many operations on the data, the most important one being that Pandas understand what you mean if you want to add or subtract some number of periods from a given date. For example:

df2['lastm'] = df2['monthly'] - 1
df2['nextq'] = df2['quarterly'] + 1
df2
firmid date return dtdate year month day newdate yearly monthly quarterly lastm nextq
0 1 09/30/2010 0.050 2010-09-30 2010 9 30 2010-09-30 2010 2010-09 2010Q3 2010-08 2010Q4
1 2 10/31/2010 0.450 2010-10-31 2010 10 31 2010-10-31 2010 2010-10 2010Q4 2010-09 2011Q1
2 3 11/30/2010 23.000 2010-11-30 2010 11 30 2010-11-30 2010 2010-11 2010Q4 2010-10 2011Q1
3 1 11/30/2010 0.870 2010-11-30 2010 11 30 2010-11-30 2010 2010-11 2010Q4 2010-10 2011Q1
4 2 11/30/2010 0.200 2010-11-30 2010 11 30 2010-11-30 2010 2010-11 2010Q4 2010-10 2011Q1
5 3 12/31/2010 0.340 2010-12-31 2010 12 31 2010-12-31 2010 2010-12 2010Q4 2010-11 2011Q1
6 1 12/31/2010 0.060 2010-12-31 2010 12 31 2010-12-31 2010 2010-12 2010Q4 2010-11 2011Q1
7 2 12/31/2010 0.001 2010-12-31 2010 12 31 2010-12-31 2010 2010-12 2010Q4 2010-11 2011Q1
8 3 12/31/2010 -0.120 2010-12-31 2010 12 31 2010-12-31 2010 2010-12 2010Q4 2010-11 2011Q1

Sorting

We can sort a dataframe based on the values in a particular column using the .sort_values() function. To sort based on the values in an index, we use the .sort_index() function.

.sort_values()

Syntax:

DataFrame.sort_values(by, axis=0, ascending=True, inplace=False)
df.sort_values('firmid')
firmid date return
0 1 09/30/2010 0.050
3 1 11/30/2010 0.870
6 1 12/31/2010 0.060
1 2 10/31/2010 0.450
4 2 11/30/2010 0.200
7 2 12/31/2010 0.001
2 3 11/30/2010 23.000
5 3 12/31/2010 0.340
8 3 12/31/2010 -0.120
df.sort_values('firmid', ascending = False)
firmid date return
2 3 11/30/2010 23.000
5 3 12/31/2010 0.340
8 3 12/31/2010 -0.120
1 2 10/31/2010 0.450
4 2 11/30/2010 0.200
7 2 12/31/2010 0.001
0 1 09/30/2010 0.050
3 1 11/30/2010 0.870
6 1 12/31/2010 0.060

Remember, functions that have an inplace parameter do not actually change the original dataset unless we set that parameter to True:

df
firmid date return
0 1 09/30/2010 0.050
1 2 10/31/2010 0.450
2 3 11/30/2010 23.000
3 1 11/30/2010 0.870
4 2 11/30/2010 0.200
5 3 12/31/2010 0.340
6 1 12/31/2010 0.060
7 2 12/31/2010 0.001
8 3 12/31/2010 -0.120
df.sort_values('firmid', inplace = True)
df
firmid date return
0 1 09/30/2010 0.050
3 1 11/30/2010 0.870
6 1 12/31/2010 0.060
1 2 10/31/2010 0.450
4 2 11/30/2010 0.200
7 2 12/31/2010 0.001
2 3 11/30/2010 23.000
5 3 12/31/2010 0.340
8 3 12/31/2010 -0.120

.sort_index()

Abbreviated syntax:

DataFrame.sort_index(axis=0, level=None, ascending=True, inplace=False)
df.sort_index()
firmid date return
0 1 09/30/2010 0.050
1 2 10/31/2010 0.450
2 3 11/30/2010 23.000
3 1 11/30/2010 0.870
4 2 11/30/2010 0.200
5 3 12/31/2010 0.340
6 1 12/31/2010 0.060
7 2 12/31/2010 0.001
8 3 12/31/2010 -0.120

Lagging and leading

In the context of most financial datasets which contain time-indexed information, lagging a particular variable (column) means obtaining the values for that particular variable from a prior point in time. Leading a variable means obtaining values from a future point in time.

If you research how to lead and lag variables in with pandas dataframes, most sources (including the official Pandas user guide) claims that you can do this using the .shift() function. In this section I first show you how to use this method, and when it runs into problems. Then show you how you can create leads and lags in a more robust way (using period dates and the merge function).

Lagging and leading using .shift()

Syntax:

DataFrame.shift(periods=1, freq=None, axis=0, fill_value=NoDefault.no_default)

Suppose we want to create a new column lag_return which tells us, for each firm, its returns from the prior month. The general advice you’ll see is that you should first sort your dataframe by firm identifier and by date:

df2 = df.sort_values(['firmid','date']).copy()
df2
firmid date return
0 1 09/30/2010 0.050
3 1 11/30/2010 0.870
6 1 12/31/2010 0.060
1 2 10/31/2010 0.450
4 2 11/30/2010 0.200
7 2 12/31/2010 0.001
2 3 11/30/2010 23.000
5 3 12/31/2010 0.340
8 3 12/31/2010 -0.120

And then use the .shift() function, after you tell Python that your dates are grouped at the firm level (each firm identifier has its own set of dates):

df2['lag_ret'] = df2.groupby('firmid')['return'].shift(1)
df2
firmid date return lag_ret
0 1 09/30/2010 0.050 NaN
3 1 11/30/2010 0.870 0.05
6 1 12/31/2010 0.060 0.87
1 2 10/31/2010 0.450 NaN
4 2 11/30/2010 0.200 0.45
7 2 12/31/2010 0.001 0.20
2 3 11/30/2010 23.000 NaN
5 3 12/31/2010 0.340 23.00
8 3 12/31/2010 -0.120 0.34

Note that the entries for lag_ret on the second row and the last row are not correct: they do not tell us what the return of the firm was in the prior month. This happens because

  1. Our data has gaps in coverage (October 2010 is missing for firmid==1)
  2. Our data has duplicates (there are two entries for December 2010 for firmid==3)

Note that both of these issues disappear if we first get rid of duplicates and if we reinterpret “lagging” to mean “the last available data point” not “data from the last calendar period” and “leading” to mean “the next available data point” and not “data from the following calendar period”. To keep things simple, this is the approach we will take in this course.

However, if this reinterpretation of lagging and leading is not exactly what you need for your application and you need to lag and lead in terms of calendar periods, you should follow the approach below:

Leading and lagging with .merge()

First, again, create a copy of the original dataset:

df2 = df.sort_values(['firmid','date']).copy()
df2
firmid date return
0 1 09/30/2010 0.050
3 1 11/30/2010 0.870
6 1 12/31/2010 0.060
1 2 10/31/2010 0.450
4 2 11/30/2010 0.200
7 2 12/31/2010 0.001
2 3 11/30/2010 23.000
5 3 12/31/2010 0.340
8 3 12/31/2010 -0.120

We will first create a new date variable that tells Python the frequency of our dates:

df2['mdate'] = pd.to_datetime(df2['date']).dt.to_period('M')
df2
firmid date return mdate
0 1 09/30/2010 0.050 2010-09
3 1 11/30/2010 0.870 2010-11
6 1 12/31/2010 0.060 2010-12
1 2 10/31/2010 0.450 2010-10
4 2 11/30/2010 0.200 2010-11
7 2 12/31/2010 0.001 2010-12
2 3 11/30/2010 23.000 2010-11
5 3 12/31/2010 0.340 2010-12
8 3 12/31/2010 -0.120 2010-12

Now create a new dataframe containing the firm identifiers (firmid), the period date (mdate) and the variable we want to lag (return):

lags = df2[['firmid','mdate','return']].copy()
lags
firmid mdate return
0 1 2010-09 0.050
3 1 2010-11 0.870
6 1 2010-12 0.060
1 2 2010-10 0.450
4 2 2010-11 0.200
7 2 2010-12 0.001
2 3 2010-11 23.000
5 3 2010-12 0.340
8 3 2010-12 -0.120

Now add (subtract in case of leads) the number of periods you want to lag the return variable (1 in our example) to the period date:

lags['mdate'] = lags['mdate'] + 1
lags
firmid mdate return
0 1 2010-10 0.050
3 1 2010-12 0.870
6 1 2011-01 0.060
1 2 2010-11 0.450
4 2 2010-12 0.200
7 2 2011-01 0.001
2 3 2010-12 23.000
5 3 2011-01 0.340
8 3 2011-01 -0.120

And rename return to lag_return:

lags = lags.rename(columns={'return':'lag_return'})
lags
firmid mdate lag_return
0 1 2010-10 0.050
3 1 2010-12 0.870
6 1 2011-01 0.060
1 2 2010-11 0.450
4 2 2010-12 0.200
7 2 2011-01 0.001
2 3 2010-12 23.000
5 3 2011-01 0.340
8 3 2011-01 -0.120

Finally, merge this lagged data into the original dataset:

df2 = df2.merge(lags, how='left', on=['firmid','mdate'])
df2.sort_values(['firmid','mdate'])
firmid date return mdate lag_return
0 1 09/30/2010 0.050 2010-09 NaN
1 1 11/30/2010 0.870 2010-11 NaN
2 1 12/31/2010 0.060 2010-12 0.87
3 2 10/31/2010 0.450 2010-10 NaN
4 2 11/30/2010 0.200 2010-11 0.45
5 2 12/31/2010 0.001 2010-12 0.20
6 3 11/30/2010 23.000 2010-11 NaN
7 3 12/31/2010 0.340 2010-12 23.00
8 3 12/31/2010 -0.120 2010-12 23.00

Note that now the second and last entry in lag_return are correct.

You still have to contend with what it means that you have duplicate entries for return for December 2010 for firmid==3 but dealing with duplicates needs to be addressed on a case by case basis, depending on the particulars of the data you are using.